Homework 02

Customer segmentation and regression problems

BDA - University of Tartu - Spring 2020

Homework instructions

  • Insert your team member names and student IDs in the field "Team mates" below. If you are not working in a team please insert only your name, surname and student ID

  • The accepted submission formats are Colab links or .ipynb files. If you are submitting Colab links please make sure that the privacy settings for the file is public so we can access your code.

  • The submission will automatically close at 12:00 am, so please make sure you have enough time to submit the homework.

  • Only one of the teammates should submit the homework. We will grade and give points to both of you!

  • You do not necessarily need to work on Colab. Especially as the size and the complexity of datasets will increase through the course, you can install jupyter notebooks locally and work from there.

  • If you do not understand what a question is asking for, please ask in Moodle.

Team mates:

Name Surname: Enlik -Student ID: B96323

Name Surname: XXXXXStudent ID: YYYY

In [1]:
## To resolve notebook json issue related to Plotly
# !pip install nbformat=4.4.0
In [2]:
import numpy as np 
import pandas as pd 
import datetime as dt 

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.offline as pyoff
import plotly.graph_objs as go
from plotly.subplots import make_subplots

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from scipy.stats import norm


## Ignore warning message
import warnings

warnings.simplefilter('ignore')

1. RFM technique (3 points)

In this homework we will apply the RFM technique similarly to what we did in the practice session. The dataset we are using belongs to http://donorschoose.org/.

The web platform http://donorschoose.org helps projects in collecting donations. The donators do have the option to give 15% of what they donate to this webplatform. The dataset contains the following fields:

  • Donor ID - Unique identifier of donors
  • Donation Included Optional Donation - The donor has agreed(Yes/No) to give 15% of donation amount to Donoschoose.org
  • Donation Amount - Total amount donated
  • Donor Cart Sequence - The ranking of the donaton project in the donors list
  • Donation Received Date - Date of online payment

1.0. Please load the dataset Donations.csv. The field "Donation Included Optional Donation" shows if the donor has agreed (Yes/No) to give 15% of donation amount to Donoschoose.org. Update the field "Donation Amount" as such that it will show the correct amount the donor has given for the charity projects in case where the answer is Yes to "Donation Included Optional Donation". (0.30 points)

In [3]:
donations = pd.read_csv("Donations.csv", sep = " ")
In [4]:
donations.head()
Out[4]:
Donor ID Donation Included Optional Donation Donation Amount Donor Cart Sequence Donation Received Date
0 1f4b5b6e68445c6c4a0509b3aca93f38 No 178.37 11 2016-08-23 13:15:57
1 bd323208dc78b1c74b62664b768f3176 Yes 200.00 2 2013-02-17 21:36:24
2 6dd6113f89f2766d3b0707ef2a46260c Yes 10.00 44 2013-02-27 10:32:22
3 344ad0a72366a27bd4926bbe5d066939 Yes 10.00 9 2017-07-24 08:40:35
4 3cf2607d4131cd654bf54a1f5a533357 Yes 25.00 5 2017-07-11 16:40:30
In [5]:
# OLD LOGIC <-- Take long time to perform, not optimized!

# def updateDonationAmount(isIncluded, amount):
#     if isIncluded == "Yes":
#         return amount - (0.15 * amount)
#     else:
#         return amount

# donations['Donation Amount'] = donations.apply(
#     lambda x: updateDonationAmount(x['Donation Included Optional Donation'], x['Donation Amount']), axis=1)
In [6]:
# NEW LOGIC

donations.loc[
    donations['Donation Included Optional Donation'] == 
    'Yes', 'Donation Amount'] = donations['Donation Amount'] - (0.15 * donations['Donation Amount'])
In [7]:
donations.head()
Out[7]:
Donor ID Donation Included Optional Donation Donation Amount Donor Cart Sequence Donation Received Date
0 1f4b5b6e68445c6c4a0509b3aca93f38 No 178.37 11 2016-08-23 13:15:57
1 bd323208dc78b1c74b62664b768f3176 Yes 170.00 2 2013-02-17 21:36:24
2 6dd6113f89f2766d3b0707ef2a46260c Yes 8.50 44 2013-02-27 10:32:22
3 344ad0a72366a27bd4926bbe5d066939 Yes 8.50 9 2017-07-24 08:40:35
4 3cf2607d4131cd654bf54a1f5a533357 Yes 21.25 5 2017-07-11 16:40:30

1.1. For Recency, consider 2018-05-31 23:00:00 as the reference date (we assume this is report date). Calculate the number of days between this date and the date of last purchase for each customer. How long the donor with ID equal to e0fe4d9b8def8a71635e65ba4ff5ef40 has been inactive ? (0.30 points)

In [8]:
# Reference
# https://stackoverflow.com/questions/50089903/convert-column-to-timestamp-pandas-dataframe

# Convert column 'Donation Received Date' into date type (timestamp)
donations['Donation Received Date'] = pd.to_datetime(donations['Donation Received Date'], format = "%Y-%m-%d %H:%M:%S")
In [9]:
# Create a generic user dataframe to keep Donor ID and new segmentation scores (RFM)
donations_user = pd.DataFrame(donations['Donor ID'].unique())
donations_user.columns = ['Donor ID']
print(donations_user)
                                Donor ID
0       1f4b5b6e68445c6c4a0509b3aca93f38
1       bd323208dc78b1c74b62664b768f3176
2       6dd6113f89f2766d3b0707ef2a46260c
3       344ad0a72366a27bd4926bbe5d066939
4       3cf2607d4131cd654bf54a1f5a533357
...                                  ...
129060  7c3a054108a9eebb441054450d76a9d3
129061  da71310482043b4da3e1bb1ab80850e8
129062  d1e1e4fc0975a7936940a3708f1860c9
129063  ead077ee750c0a4d2190afdb202a6343
129064  56729cfddae7e622f7d49ba5877c4e5c

[129065 rows x 1 columns]
In [10]:
# Get the last purchase date for each donors and create a dataframe from it 
lastPurchaseDate = donations.groupby('Donor ID')['Donation Received Date'].max().reset_index()
lastPurchaseDate.columns = ['Donor ID', 'LastPurchaseDate']
print(lastPurchaseDate)
                                Donor ID    LastPurchaseDate
0       00002d44003ed46b066607c5455a999a 2017-10-18 14:34:11
1       00002eb25d60a09c318efbd0797bffb5 2018-01-16 15:32:41
2       00006084c3d92d904a22e0a70f5c119a 2018-04-14 23:04:42
3       0001107b9faa5c3bb42cfcecece1d587 2018-03-24 19:43:59
4       0001f63e9437ebbba6ddaae0664037a7 2017-03-09 07:10:03
...                                  ...                 ...
129060  fffc8063fefd271719cb27cd019de746 2016-09-01 22:40:57
129061  fffcffed2771863a61c9669049cc6876 2013-11-25 23:49:34
129062  fffd67203dc580485e30d8a8820ef2bf 2018-04-25 18:50:51
129063  fffd69c0f1dc11f7fc10d595891723a2 2018-03-15 00:38:54
129064  fffe56ecd1382490b5995a308d9476d9 2015-08-05 17:25:37

[129065 rows x 2 columns]
In [11]:
# create variable reportDate based on mentioned reference date above
reportDate = dt.datetime.strptime('2018-05-31 23:00:00', '%Y-%m-%d %H:%M:%S')

lastPurchaseDate['Recency'] = (reportDate - lastPurchaseDate['LastPurchaseDate']).dt.days
print(lastPurchaseDate)
                                Donor ID    LastPurchaseDate  Recency
0       00002d44003ed46b066607c5455a999a 2017-10-18 14:34:11      225
1       00002eb25d60a09c318efbd0797bffb5 2018-01-16 15:32:41      135
2       00006084c3d92d904a22e0a70f5c119a 2018-04-14 23:04:42       46
3       0001107b9faa5c3bb42cfcecece1d587 2018-03-24 19:43:59       68
4       0001f63e9437ebbba6ddaae0664037a7 2017-03-09 07:10:03      448
...                                  ...                 ...      ...
129060  fffc8063fefd271719cb27cd019de746 2016-09-01 22:40:57      637
129061  fffcffed2771863a61c9669049cc6876 2013-11-25 23:49:34     1647
129062  fffd67203dc580485e30d8a8820ef2bf 2018-04-25 18:50:51       36
129063  fffd69c0f1dc11f7fc10d595891723a2 2018-03-15 00:38:54       77
129064  fffe56ecd1382490b5995a308d9476d9 2015-08-05 17:25:37     1030

[129065 rows x 3 columns]
In [12]:
# Merge into main dataframe
donations_user = pd.merge(donations_user, lastPurchaseDate[['Donor ID','Recency']], on='Donor ID')
print(donations_user)
                                Donor ID  Recency
0       1f4b5b6e68445c6c4a0509b3aca93f38      126
1       bd323208dc78b1c74b62664b768f3176     1012
2       6dd6113f89f2766d3b0707ef2a46260c     1025
3       344ad0a72366a27bd4926bbe5d066939      242
4       3cf2607d4131cd654bf54a1f5a533357      133
...                                  ...      ...
129060  7c3a054108a9eebb441054450d76a9d3       93
129061  da71310482043b4da3e1bb1ab80850e8     1871
129062  d1e1e4fc0975a7936940a3708f1860c9      498
129063  ead077ee750c0a4d2190afdb202a6343      289
129064  56729cfddae7e622f7d49ba5877c4e5c     1668

[129065 rows x 2 columns]
In [13]:
# plot a recency histogram distribution of recency across our donors.
# reference: BDA2020 - lab03 practice material
# https://drive.google.com/file/d/1bTAKH9MOyf3Wg5qZ3nrjNAqD_plwGt4H/view

plot_data = [
    go.Histogram(
        x=donations_user['Recency']
    )
]

plot_layout = go.Layout(
        title='Distribution of recency across our Donor.'
    )

fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
In [14]:
donations_user[donations_user['Donor ID'] == "e0fe4d9b8def8a71635e65ba4ff5ef40"]
Out[14]:
Donor ID Recency
67193 e0fe4d9b8def8a71635e65ba4ff5ef40 34

Answer: 34 days

1.2. For the "Frequency" calculate how many times a user has donated. What is the frequency value for the donor with ID e0fe4d9b8def8a71635e65ba4ff5ef40? (0.30 points)

In [15]:
# create a dataframe with frequency column based on total count of Donation for every donor ID
donations_frequency = donations.groupby('Donor ID')['Donation Received Date'].count().reset_index()
donations_frequency.columns = ['Donor ID', 'Frequency']
donations_frequency
Out[15]:
Donor ID Frequency
0 00002d44003ed46b066607c5455a999a 11
1 00002eb25d60a09c318efbd0797bffb5 5
2 00006084c3d92d904a22e0a70f5c119a 18
3 0001107b9faa5c3bb42cfcecece1d587 47
4 0001f63e9437ebbba6ddaae0664037a7 7
... ... ...
129060 fffc8063fefd271719cb27cd019de746 5
129061 fffcffed2771863a61c9669049cc6876 5
129062 fffd67203dc580485e30d8a8820ef2bf 5
129063 fffd69c0f1dc11f7fc10d595891723a2 14
129064 fffe56ecd1382490b5995a308d9476d9 8

129065 rows × 2 columns

In [16]:
# merge the new dataframe into our new user dataframe
donations_user = pd.merge(donations_user, donations_frequency, on="Donor ID")
donations_user
Out[16]:
Donor ID Recency Frequency
0 1f4b5b6e68445c6c4a0509b3aca93f38 126 314
1 bd323208dc78b1c74b62664b768f3176 1012 5
2 6dd6113f89f2766d3b0707ef2a46260c 1025 27
3 344ad0a72366a27bd4926bbe5d066939 242 9
4 3cf2607d4131cd654bf54a1f5a533357 133 8
... ... ... ...
129060 7c3a054108a9eebb441054450d76a9d3 93 6
129061 da71310482043b4da3e1bb1ab80850e8 1871 5
129062 d1e1e4fc0975a7936940a3708f1860c9 498 6
129063 ead077ee750c0a4d2190afdb202a6343 289 10
129064 56729cfddae7e622f7d49ba5877c4e5c 1668 5

129065 rows × 3 columns

In [17]:
donations_user['Frequency'].describe()
Out[17]:
count    129065.000000
mean         16.810398
std         112.850319
min           5.000000
25%           6.000000
50%           8.000000
75%          12.000000
max       18035.000000
Name: Frequency, dtype: float64
In [18]:
# plot a frequency histogram distribution of recency across our donors.
plot_data = [
    go.Histogram(
        x=donations_user.query('Frequency < 100')['Frequency']
    )
]

plot_layout = go.Layout(
        title='Distribution of Frequency across our Donors.'
    )

fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
In [19]:
donations_frequency[donations_frequency['Donor ID'] == 'e0fe4d9b8def8a71635e65ba4ff5ef40']
Out[19]:
Donor ID Frequency
113469 e0fe4d9b8def8a71635e65ba4ff5ef40 9

Answer: 9

1.3. For the "Revenue" calculate how much a donor has donated. What is the revenue value for the donor with ID e0fe4d9b8def8a71635e65ba4ff5ef40? (0.30 points)

In [20]:
donations.head()
Out[20]:
Donor ID Donation Included Optional Donation Donation Amount Donor Cart Sequence Donation Received Date
0 1f4b5b6e68445c6c4a0509b3aca93f38 No 178.37 11 2016-08-23 13:15:57
1 bd323208dc78b1c74b62664b768f3176 Yes 170.00 2 2013-02-17 21:36:24
2 6dd6113f89f2766d3b0707ef2a46260c Yes 8.50 44 2013-02-27 10:32:22
3 344ad0a72366a27bd4926bbe5d066939 Yes 8.50 9 2017-07-24 08:40:35
4 3cf2607d4131cd654bf54a1f5a533357 Yes 21.25 5 2017-07-11 16:40:30
In [21]:
# Calculate revenue for each donors based on their total donation amount
donations_revenue = donations.groupby('Donor ID')['Donation Amount'].sum().reset_index()
donations_revenue.columns = ['Donor ID', 'Revenue']
donations_revenue
Out[21]:
Donor ID Revenue
0 00002d44003ed46b066607c5455a999a 597.6075
1 00002eb25d60a09c318efbd0797bffb5 178.5000
2 00006084c3d92d904a22e0a70f5c119a 420.7500
3 0001107b9faa5c3bb42cfcecece1d587 840.5000
4 0001f63e9437ebbba6ddaae0664037a7 239.7000
... ... ...
129060 fffc8063fefd271719cb27cd019de746 106.2500
129061 fffcffed2771863a61c9669049cc6876 212.5000
129062 fffd67203dc580485e30d8a8820ef2bf 182.7500
129063 fffd69c0f1dc11f7fc10d595891723a2 387.5465
129064 fffe56ecd1382490b5995a308d9476d9 2126.3100

129065 rows × 2 columns

In [22]:
# Merge with our main dataframe
donations_user = pd.merge(donations_user, donations_revenue, on="Donor ID")
donations_user
Out[22]:
Donor ID Recency Frequency Revenue
0 1f4b5b6e68445c6c4a0509b3aca93f38 126 314 139170.7300
1 bd323208dc78b1c74b62664b768f3176 1012 5 544.0000
2 6dd6113f89f2766d3b0707ef2a46260c 1025 27 519.3500
3 344ad0a72366a27bd4926bbe5d066939 242 9 344.8265
4 3cf2607d4131cd654bf54a1f5a533357 133 8 198.7500
... ... ... ... ...
129060 7c3a054108a9eebb441054450d76a9d3 93 6 102.0000
129061 da71310482043b4da3e1bb1ab80850e8 1871 5 148.7500
129062 d1e1e4fc0975a7936940a3708f1860c9 498 6 191.2500
129063 ead077ee750c0a4d2190afdb202a6343 289 10 178.5000
129064 56729cfddae7e622f7d49ba5877c4e5c 1668 5 319.6000

129065 rows × 4 columns

In [23]:
donations_user.describe()
Out[23]:
Recency Frequency Revenue
count 129065.000000 129065.000000 1.290650e+05
mean 418.500631 16.810398 9.495786e+02
std 422.033947 112.850319 8.456215e+03
min 22.000000 5.000000 5.100000e-02
25% 107.000000 6.000000 1.775225e+02
50% 253.000000 8.000000 3.287545e+02
75% 592.000000 12.000000 6.800000e+02
max 1972.000000 18035.000000 1.727650e+06
In [24]:
# Plot the histogram
plot_data = [
    go.Histogram(
        x=donations_user.query('Revenue < 1000')['Revenue']
    )
]

plot_layout = go.Layout(
        title='Monetary Value'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
In [25]:
donations_user[donations_user['Donor ID'] == 'e0fe4d9b8def8a71635e65ba4ff5ef40']
Out[25]:
Donor ID Recency Frequency Revenue
67193 e0fe4d9b8def8a71635e65ba4ff5ef40 34 9 708.6705

Answer: 708.6705

1.4. Use the quantiles to separate Recency, Frequency, Revenue into four bins. We define our high value customers which have low Recency value and high Frequency and Revenue values. Calculate score for each donor which is equal to sum of bin values as a RFM components and save it in a new column. (0.50 points)

In [26]:
RFM = donations_user
RFM = RFM.set_index('Donor ID')
RFM.head()
Out[26]:
Recency Frequency Revenue
Donor ID
1f4b5b6e68445c6c4a0509b3aca93f38 126 314 139170.7300
bd323208dc78b1c74b62664b768f3176 1012 5 544.0000
6dd6113f89f2766d3b0707ef2a46260c 1025 27 519.3500
344ad0a72366a27bd4926bbe5d066939 242 9 344.8265
3cf2607d4131cd654bf54a1f5a533357 133 8 198.7500
In [27]:
fig, ax = plt.subplots()

for i, col in enumerate(RFM.columns):
    ax=fig.add_subplot(1, 3, i+1)
    sns.distplot(RFM[col],fit=norm, kde=False,ax=ax, color='b')
    plt.title(col);
    plt.axis('off')
    plt.tight_layout()
In [28]:
rfm = RFM
rfm['r_quartile'] = pd.qcut(rfm['Recency'], 4, ['1','2','3','4'])
rfm['f_quartile'] = pd.qcut(rfm['Frequency'], 4, ['4','3','2','1'])
rfm['m_quartile'] = pd.qcut(rfm['Revenue'], 4, ['4','3','2','1'])
rfm.head()
Out[28]:
Recency Frequency Revenue r_quartile f_quartile m_quartile
Donor ID
1f4b5b6e68445c6c4a0509b3aca93f38 126 314 139170.7300 2 1 1
bd323208dc78b1c74b62664b768f3176 1012 5 544.0000 4 4 2
6dd6113f89f2766d3b0707ef2a46260c 1025 27 519.3500 4 1 2
344ad0a72366a27bd4926bbe5d066939 242 9 344.8265 2 2 2
3cf2607d4131cd654bf54a1f5a533357 133 8 198.7500 2 3 3
In [29]:
rfm.r_quartile.unique()
Out[29]:
[2, 4, 1, 3]
Categories (4, object): [1 < 2 < 3 < 4]
In [30]:
rfm.f_quartile.unique()
Out[30]:
[1, 4, 2, 3]
Categories (4, object): [4 < 3 < 2 < 1]
In [31]:
rfm.m_quartile.unique()
Out[31]:
[1, 2, 3, 4]
Categories (4, object): [4 < 3 < 2 < 1]
In [32]:
# Calculate RFM Score
# rfm['RFM_Score'] = rfm.r_quartile.astype(str)+ rfm.f_quartile.astype(str) + rfm.m_quartile.astype(str)
rfm['RFM_Score'] = rfm.r_quartile.astype(int)+ rfm.f_quartile.astype(int) + rfm.m_quartile.astype(int)
rfm.head(5)
Out[32]:
Recency Frequency Revenue r_quartile f_quartile m_quartile RFM_Score
Donor ID
1f4b5b6e68445c6c4a0509b3aca93f38 126 314 139170.7300 2 1 1 4
bd323208dc78b1c74b62664b768f3176 1012 5 544.0000 4 4 2 10
6dd6113f89f2766d3b0707ef2a46260c 1025 27 519.3500 4 1 2 7
344ad0a72366a27bd4926bbe5d066939 242 9 344.8265 2 2 2 6
3cf2607d4131cd654bf54a1f5a533357 133 8 198.7500 2 3 3 8
In [33]:
# Filter out Top/Best donors
# RFM_Score = '111' means our best donors' score
# rfm[rfm['RFM_Score']=='111'].sort_values('Revenue', ascending=False).head()
rfm[rfm['RFM_Score']==12].sort_values('Revenue', ascending=False).head()
Out[33]:
Recency Frequency Revenue r_quartile f_quartile m_quartile RFM_Score
Donor ID
1394f27e6548abd3da1a3622a85159e5 608 6 177.5 4 4 4 12
27737372f14c961f7ca6197bbb13a513 635 6 177.5 4 4 4 12
7addbd30638a44eb0eea8727174b0544 755 6 177.5 4 4 4 12
20398e26c07b8f4cf37b36d40050b285 636 5 177.5 4 4 4 12
6d6777b4d7e6b84fa36db74cf3deb033 858 5 177.5 4 4 4 12

1.5. Build a 3D graph where axes represent the R, F and, M values and the hue (color of shade) represents the score. Based on the graph on how many categories would you cluster the donors? Please explain your decision. (0.50 points)

In [34]:
import plotly.express as px
dff = rfm
fig = px.scatter_3d(dff, x='Revenue', y='Frequency', z='Recency',
              color='RFM_Score', size_max=20,opacity=0.7)
fig.show()

Answer:

Based on 3d graph above, I'll cluster the donors into four categories. The reason is because of the the group of same colors on bottom side (dark blue), middle side (purple and orange), and top side (yellow).

1.6. Interpret what the minimum, maximum and mean score indicates about your donors ?(0.50 points)

In [35]:
rfm['RFM_Score'] = rfm['RFM_Score'].astype(int)
rfm['RFM_Score'].describe()
Out[35]:
count    129065.000000
mean          7.724821
std           2.559632
min           3.000000
25%           6.000000
50%           8.000000
75%          10.000000
max          12.000000
Name: RFM_Score, dtype: float64

Answer:

  • minimum = 3, it means that these are our lowest performance or worst donors who have the highest recency, lowest frequency and revenues
  • maximum = 12, it means that these are our top or best donors who have the lowest recency, highest frequency and revenues
  • mean = 7.72, it means that the average RFM_Score of our donors are around 7-8

1.7. Based on your answer in the question 1.5, please threshold the scores to create as many clusters as you think is reasonable. Assign integer values to those clusters to ease the identification and add a new column to your RFM data frame which shows the cluster each record belongs to. (0.30 points)

In [36]:
rfm.loc[rfm['RFM_Score'] <= 12, 'ClusterType'] = 1 # the first-medium value
rfm.loc[rfm['RFM_Score'] <= 8, 'ClusterType'] = 2 # the second-medium value
rfm.loc[rfm['RFM_Score'] < 7, 'ClusterType'] = 3 # the highest value
rfm.loc[rfm['RFM_Score'] == 3, 'ClusterType'] = 4 # the lowest value

rfm.head()
Out[36]:
Recency Frequency Revenue r_quartile f_quartile m_quartile RFM_Score ClusterType
Donor ID
1f4b5b6e68445c6c4a0509b3aca93f38 126 314 139170.7300 2 1 1 4 3.0
bd323208dc78b1c74b62664b768f3176 1012 5 544.0000 4 4 2 10 1.0
6dd6113f89f2766d3b0707ef2a46260c 1025 27 519.3500 4 1 2 7 2.0
344ad0a72366a27bd4926bbe5d066939 242 9 344.8265 2 2 2 6 3.0
3cf2607d4131cd654bf54a1f5a533357 133 8 198.7500 2 3 3 8 2.0

2. Automated clustering algorithms (4.1 points)

2.1. Use K-Means algorithm with parameter "max_iter = 1", to get the same number of clusters for RFM as you selected in the exercise 1.5. Feed the RFM values at the same time to the algorithm (not separately). Store your result in a new column and name this column as Kmeans_cluster.(0.50 points)

Rescaling the Attributes

It is extremely important to rescale the variables so that they have a comparable scale.| There are two common ways of rescaling:

Min-Max scaling
Standardisation (mean-0, sigma-1)

Here, we will use Standardisation Scaling.

Reference: Kaggle

In [37]:
rfm_df = rfm[['Recency', 'Frequency', 'Revenue']]
rfm_df.head()
Out[37]:
Recency Frequency Revenue
Donor ID
1f4b5b6e68445c6c4a0509b3aca93f38 126 314 139170.7300
bd323208dc78b1c74b62664b768f3176 1012 5 544.0000
6dd6113f89f2766d3b0707ef2a46260c 1025 27 519.3500
344ad0a72366a27bd4926bbe5d066939 242 9 344.8265
3cf2607d4131cd654bf54a1f5a533357 133 8 198.7500
In [38]:
# Instantiate
scaler = StandardScaler()

# fit_transform
rfm_df_scaled = scaler.fit_transform(rfm_df)
rfm_df_scaled.shape
Out[38]:
(129065, 3)
In [39]:
rfm_df_scaled = pd.DataFrame(rfm_df_scaled)
rfm_df_scaled.columns = ['Amount', 'Frequency', 'Recency']
rfm_df_scaled.head()
Out[39]:
Amount Frequency Recency
0 -0.693076 2.633495 16.345575
1 1.406289 -0.104656 -0.047962
2 1.437092 0.090293 -0.050877
3 -0.418216 -0.069211 -0.071516
4 -0.676490 -0.078072 -0.088790
In [40]:
kmeans = KMeans(n_clusters=12, max_iter=1).fit(rfm_df_scaled)
rfm["Kmeans_cluster"] = kmeans.labels_+1
rfm.head()
Out[40]:
Recency Frequency Revenue r_quartile f_quartile m_quartile RFM_Score ClusterType Kmeans_cluster
Donor ID
1f4b5b6e68445c6c4a0509b3aca93f38 126 314 139170.7300 2 1 1 4 3.0 9
bd323208dc78b1c74b62664b768f3176 1012 5 544.0000 4 4 2 10 1.0 12
6dd6113f89f2766d3b0707ef2a46260c 1025 27 519.3500 4 1 2 7 2.0 12
344ad0a72366a27bd4926bbe5d066939 242 9 344.8265 2 2 2 6 3.0 2
3cf2607d4131cd654bf54a1f5a533357 133 8 198.7500 2 3 3 8 2.0 2
In [41]:
from sklearn.cluster import KMeans

sse={}

for k in range(1, 12):
    kmeans = KMeans(n_clusters=k, max_iter=1).fit(rfm_df_scaled)
#     tx_recency["clusters"] = kmeans.labels_
    sse[k] = kmeans.inertia_ 
    
#print(sse)
    
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()

2.2. Take a look at the cluster centers given by KMeans after prediction. Are they identical to mean value of each cluster ? If yes, why do you think so? If no, why do you think so? In case the answer is no, please try to change the parameters of the function so the cluster center and the mean value of clusters will be identical. Store and report the new results. (0.75 points)

In [42]:
rfm.groupby('Kmeans_cluster')['RFM_Score'].describe()
Out[42]:
count mean std min 25% 50% 75% max
Kmeans_cluster
1 27158.0 8.867663 1.927740 5.0 7.0 9.0 10.00 12.0
2 74817.0 6.607776 2.273300 3.0 5.0 7.0 8.00 11.0
3 15.0 3.333333 0.899735 3.0 3.0 3.0 3.00 6.0
4 6.0 3.000000 0.000000 3.0 3.0 3.0 3.00 3.0
5 11055.0 10.123202 1.777068 6.0 9.0 10.0 12.00 12.0
6 2.0 3.000000 0.000000 3.0 3.0 3.0 3.00 3.0
7 54.0 3.240741 0.698656 3.0 3.0 3.0 3.00 6.0
8 1.0 3.000000 NaN 3.0 3.0 3.0 3.00 3.0
9 561.0 3.647059 1.047205 3.0 3.0 3.0 4.00 9.0
10 13.0 3.538462 0.967418 3.0 3.0 3.0 4.00 6.0
11 336.0 3.404762 0.804748 3.0 3.0 3.0 3.25 6.0
12 15047.0 9.729647 1.858017 6.0 8.0 10.0 11.00 12.0

Answer:

Based on table above, all of the cluster centers are identical to mean value of each cluster.

The reason is because cluster center (or centroid) is the average values of all data points belong to the cluster itself

2.3. Compare the results you produced in exercise 1.7 with the results generated by KMeans i.e number of elements in each class, mean and std values for elements in each class. Which approach do you think is better or are they similar ? Hint: Plot histogram and/or boxplot (0.75 points)

In [43]:
km = KMeans(n_clusters=4, max_iter=1)
km.fit(rfm_df)
rfm["Kmeans_cluster"] = kmeans.labels_+1
rfm.head()
Out[43]:
Recency Frequency Revenue r_quartile f_quartile m_quartile RFM_Score ClusterType Kmeans_cluster
Donor ID
1f4b5b6e68445c6c4a0509b3aca93f38 126 314 139170.7300 2 1 1 4 3.0 10
bd323208dc78b1c74b62664b768f3176 1012 5 544.0000 4 4 2 10 1.0 3
6dd6113f89f2766d3b0707ef2a46260c 1025 27 519.3500 4 1 2 7 2.0 3
344ad0a72366a27bd4926bbe5d066939 242 9 344.8265 2 2 2 6 3.0 1
3cf2607d4131cd654bf54a1f5a533357 133 8 198.7500 2 3 3 8 2.0 8
In [44]:
# plt.hist(rfm.ClusterType, bins = 4)
# plt.show()

# plot_data = [
#     go.Histogram(
#         x=rfm.ClusterType
#     )
# ]

# plot_layout = go.Layout(
#         title='Distribution of ClusterType'
#     )

# fig = go.Figure(data=plot_data, layout=plot_layout)
# pyoff.iplot(fig)


fig = make_subplots(rows=1, cols=2)

fig.add_trace(
    go.Histogram(x=rfm.ClusterType, name="Exercise 1.7"),
    row=1, col=1
)

fig.add_trace(
    go.Histogram(x=rfm.Kmeans_cluster, name="Kmeans"),
    row=1, col=2
)

fig.update_layout(height=600, width=800, title_text="Comparison of the Cluster Results")
# fig.show()
pyoff.iplot(fig)
In [45]:
rfm.ClusterType.describe()
Out[45]:
count    129065.000000
mean          1.966513
std           0.976952
min           1.000000
25%           1.000000
50%           2.000000
75%           3.000000
max           4.000000
Name: ClusterType, dtype: float64
In [46]:
rfm.Kmeans_cluster.describe()
Out[46]:
count    129065.000000
mean          5.236090
std           3.359834
min           1.000000
25%           1.000000
50%           8.000000
75%           8.000000
max          11.000000
Name: Kmeans_cluster, dtype: float64

Answer:

  • From the result of the exercise 1.7, it showed relatively balanced value for each class, with cluster type 1 is the highest one with almost 50% of total number of elements
  • From the result of the Kmeans algorithm, it showed cluster number 1 give more than 70% of total number of elements

2.4. To overcome the limitations of KMeans many other clustering algorithms are used, one of them is DBSCAN. Mention at least 2 advantages of DBSCAN over KMeans and 2 disadvantages of DBSCAN.(0.30 points)

Answer:

Advantages:

  • DBSCAN can resolves some problems from KMeans (which doesn't care about how dense the data present), by working with the density of points
  • Great when handling with outliers in the dataset

Disadvantages:

  • It doesn't work well with the cluster which has varying densities
  • Struggles with high dimensionality data

2.5. Use DBSCAN with its default parameters to generate clusters based on RFM values. How many clusters did it create? Was the value same as you selected in the exercise 1.5?(0.50 points)

In [47]:
rfm_df_scaled.count()
Out[47]:
Amount       129065
Frequency    129065
Recency      129065
dtype: int64
In [48]:
# Because DBSCAN had struggles with high dimensionality data, we subset the 120k+ rows of data into 50k rows
rfm_df_scaled_50k = rfm_df_scaled.head(50000)
rfm_df_scaled_50k.count()
Out[48]:
Amount       50000
Frequency    50000
Recency      50000
dtype: int64
In [49]:
# We do the same for rfm dataframe to make it equal for comparison
rfm_50k = rfm.head(50000)
rfm_50k.count()
Out[49]:
Recency           50000
Frequency         50000
Revenue           50000
r_quartile        50000
f_quartile        50000
m_quartile        50000
RFM_Score         50000
ClusterType       50000
Kmeans_cluster    50000
dtype: int64
In [50]:
dbscan = DBSCAN()
# dbscan.fit(rfm_df_scaled)
# dbscan.fit(rfm['RFM_Score'])
dbscan.fit(rfm_df_scaled_50k)
Out[50]:
DBSCAN(algorithm='auto', eps=0.5, leaf_size=30, metric='euclidean',
       metric_params=None, min_samples=5, n_jobs=None, p=None)
In [51]:
rfm_50k['DBSCAN_Cluster'] = dbscan.labels_
rfm_50k.head()
Out[51]:
Recency Frequency Revenue r_quartile f_quartile m_quartile RFM_Score ClusterType Kmeans_cluster DBSCAN_Cluster
Donor ID
1f4b5b6e68445c6c4a0509b3aca93f38 126 314 139170.7300 2 1 1 4 3.0 10 -1
bd323208dc78b1c74b62664b768f3176 1012 5 544.0000 4 4 2 10 1.0 3 0
6dd6113f89f2766d3b0707ef2a46260c 1025 27 519.3500 4 1 2 7 2.0 3 0
344ad0a72366a27bd4926bbe5d066939 242 9 344.8265 2 2 2 6 3.0 1 0
3cf2607d4131cd654bf54a1f5a533357 133 8 198.7500 2 3 3 8 2.0 8 0
In [52]:
rfm_50k['DBSCAN_Cluster'].unique()
Out[52]:
array([-1,  0,  1,  2,  3,  6,  4,  7,  5])

DBSCAN created 9 clusters, and it was not same with the cluster that I've selected in exercise 1.5

2.6. In 2.5, how many noisy data points were identified by the DBSCAN? (0.30 points)

In [53]:
# label = -1 means it's a noise data point (outlier)
# Reference: 
# https://medium.com/@agarwalvibhor84/lets-cluster-data-points-using-dbscan-278c5459bee5
# https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
rfm_50k[rfm_50k['DBSCAN_Cluster'] == -1]['RFM_Score'].count()
Out[53]:
247

Answer: 247 (using 50,000 rows subset dataframe)

2.7 Try different combinations of "eps" and "min_samples" parameters for DBSCAN in order to reach the same number of classes as in exercise 1.5 (at most you can have a difference of 10 classes). Report the combination that gave the best result (0.75 points). Why do you think it worked and the number of classes decreased compared to exercise 2.5? (0.25 points)

In [54]:
dbscan27 = DBSCAN(eps=0.3, min_samples=10)
# dbscan27 = DBSCAN()
print(dbscan27.fit(rfm_df_scaled_50k))

np.unique(dbscan27.labels_)
DBSCAN(algorithm='auto', eps=0.3, leaf_size=30, metric='euclidean',
       metric_params=None, min_samples=10, n_jobs=None, p=None)
Out[54]:
array([-1,  0,  1,  2,  3])

Answer:

  • DBSCAN works by determining the minimum number of points (min_samples) are close enough to one another to be considered as part of a same cluster.
  • When the min_samples parameter value was increased, it will decreased the number of cluster created
  • Epsilon (eps) value works as the threshold value for the distance between two points in data, if the distance below eps value, it considered as neighbour. By reducing eps value from 0.5 to 0.3, it will decreasing the number of cluster created.

Reference

3. Regression (2.9 points)

For this exercise we are going to use the Prices dataset which contains 74 columns. Each column explains a characteristic of houses in sale. The last column is describing their prices.

In [55]:
data = pd.read_csv("Prices.csv")
data.head()
Out[55]:
MSSubClass MSZoning LotFrontage LotArea Street LotShape LandContour Utilities LotConfig LandSlope ... EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 60 RL 65.0 8450 Pave Reg Lvl AllPub Inside Gtl ... 0 0 0 0 0 2 2008 WD Normal 208500
1 20 RL 80.0 9600 Pave Reg Lvl AllPub FR2 Gtl ... 0 0 0 0 0 5 2007 WD Normal 181500
2 60 RL 68.0 11250 Pave IR1 Lvl AllPub Inside Gtl ... 0 0 0 0 0 9 2008 WD Normal 223500
3 70 RL 60.0 9550 Pave IR1 Lvl AllPub Corner Gtl ... 272 0 0 0 0 2 2006 WD Abnorml 140000
4 60 RL 84.0 14260 Pave IR1 Lvl AllPub FR2 Gtl ... 0 0 0 0 0 12 2008 WD Normal 250000

5 rows × 75 columns

Before using the data we have to apply some preprocessing:

  • Apply one-hot encoding to categorical data types
  • Apply logarithmic transformation to the data
  • Replace negative infinite values with 0
In [56]:
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]

data = pd.get_dummies(X)
data["SalePrice"]=y

data = np.log(data)

data[data==-np.inf]=0

Now the data is ready for you to use in the next exercises.

3.1. Calculate the correlation between price and each feature. Which are the top 3 features that have the highest correlation with price? Is the correlation positive or negative? Explain what happens with the price when each of those 3 features change (consider only one feature at a time) and others are kept constant. (0.50 points)

In [57]:
data.head()
Out[57]:
MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 ... SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial SalePrice
0 4.094345 4.174387 9.041922 1.945910 1.609438 7.602401 7.602401 5.278115 6.559615 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 12.247694
1 2.995732 4.382027 9.169518 1.791759 2.079442 7.588830 7.588830 0.000000 6.885510 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 12.109011
2 4.094345 4.219508 9.328123 1.945910 1.609438 7.601402 7.601902 5.087596 6.186209 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 12.317167
3 4.248495 4.094345 9.164296 1.945910 1.609438 7.557473 7.585789 0.000000 5.375278 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 11.849398
4 4.094345 4.430817 9.565214 2.079442 1.609438 7.600902 7.600902 5.857933 6.484635 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 12.429216

5 rows × 271 columns

In [58]:
correlation = data.corr()
# print(correlation['SalePrice'].sort_values(ascending = False), '\n')
correlation[correlation['SalePrice'] >= 0]['SalePrice'].sort_values(ascending = False)
Out[58]:
SalePrice       1.000000
OverallQual     0.794007
GrLivArea       0.730255
GarageCars      0.659806
1stFlrSF        0.608947
FullBath        0.598520
YearBuilt       0.584442
YearRemodAdd    0.565590
GarageYrBlt     0.539621
TotRmsAbvGrd    0.539261
OpenPorchSF     0.460081
GarageArea      0.455197
MasVnrArea      0.418752
LotArea         0.399918
TotalBsmtSF     0.373009
LotFrontage     0.364108
WoodDeckSF      0.343269
Fireplaces      0.210056
BedroomAbvGr    0.209210
BsmtUnfSF       0.208103
BsmtFinSF1      0.207962
2ndFlrSF        0.180828
ScreenPorch     0.105909
PoolArea        0.069949
3SsnPorch       0.058838
MoSold          0.051392
BsmtFullBath    0.042886
OverallCond     0.008710
Name: SalePrice, dtype: float64

**Answer:**

  • Top 3 features with highest correlation with SalePrice:
    • OverallQual
    • GrLivArea
    • GarageCars
  • Correlation is positive
  • When each of these 3 features got a change of value, the SalePrice variable will change significantly

3.2. In this exercise we have to build a regression model using training data and then predict the price in test data. You are free to select features which you want to use for building the model. As the data has missing values we are asking you to try the following methods for dealing with the missing data:

a) mean imputation

b) median imputation

c) mode imputation

d) dropping missing values

For each of these cases report MAE, RMSE and R2. Which method works better ?

To get the training and test set, split the data into 20% test set and 80% training set using train_test_split function from scikit-learn. Keep the parameter random_state equal to 0. (1.50 points)

In [59]:
# Import Library
from sklearn.model_selection import train_test_split # for data splitting
from sklearn.metrics import mean_squared_error # for model evaluation
from sklearn.metrics import r2_score # model evaluation
from sklearn.metrics import median_absolute_error # model evaluation
from sklearn.linear_model import LinearRegression
In [60]:
# Select features which have correlation with SalePrice more than 50 percents
correlation[correlation['SalePrice'] >= 0.5]['SalePrice'].sort_values(ascending = False)
Out[60]:
SalePrice       1.000000
OverallQual     0.794007
GrLivArea       0.730255
GarageCars      0.659806
1stFlrSF        0.608947
FullBath        0.598520
YearBuilt       0.584442
YearRemodAdd    0.565590
GarageYrBlt     0.539621
TotRmsAbvGrd    0.539261
Name: SalePrice, dtype: float64

a. Use Mean Imputation

In [61]:
# Read csv
data = pd.read_csv("Prices.csv")
data.head()

# Data preprocessing
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]
data = pd.get_dummies(X)
data["SalePrice"]=y
data = np.log(data)
data[data==-np.inf]=0

# Mean Imputation
data.fillna(data.mean(), inplace=True)
print(data.isnull().sum().sum()) # check total null value
0
In [62]:
# Try to change one of the highest correlated features
highest_corr_features = data[['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea',
                             'FullBath', '1stFlrSF', 'YearRemodAdd', 'YearBuilt', 'GarageYrBlt', 'TotRmsAbvGrd']]
# highest_corr_features.head()
corr = highest_corr_features.corr()
corr.style.background_gradient(cmap='Blues')
Out[62]:
SalePrice OverallQual GrLivArea GarageCars GarageArea FullBath 1stFlrSF YearRemodAdd YearBuilt GarageYrBlt TotRmsAbvGrd
SalePrice 1.000000 0.794007 0.730255 0.659806 0.455197 0.598520 0.608947 0.565590 0.584442 0.499106 0.539261
OverallQual 0.794007 1.000000 0.601148 0.556868 0.397636 0.531549 0.431252 0.536980 0.553431 0.480274 0.426613
GrLivArea 0.730255 0.601148 1.000000 0.496139 0.298299 0.652849 0.545984 0.311515 0.232931 0.238416 0.827690
GarageCars 0.659806 0.556868 0.496139 1.000000 0.530889 0.537545 0.441902 0.453795 0.556988 0.572417 0.363041
GarageArea 0.455197 0.397636 0.298299 0.530889 1.000000 0.228520 0.286945 0.207472 0.341475 0.157351 0.193529
FullBath 0.598520 0.531549 0.652849 0.537545 0.228520 1.000000 0.373942 0.451403 0.490966 0.487067 0.544117
1stFlrSF 0.608947 0.431252 0.545984 0.441902 0.286945 0.373942 1.000000 0.233819 0.277025 0.213175 0.414699
YearRemodAdd 0.565590 0.536980 0.311515 0.453795 0.207472 0.451403 0.233819 1.000000 0.589813 0.616155 0.196213
YearBuilt 0.584442 0.553431 0.232931 0.556988 0.341475 0.490966 0.277025 0.589813 1.000000 0.777647 0.105666
GarageYrBlt 0.499106 0.480274 0.238416 0.572417 0.157351 0.487067 0.213175 0.616155 0.777647 1.000000 0.138227
TotRmsAbvGrd 0.539261 0.426613 0.827690 0.363041 0.193529 0.544117 0.414699 0.196213 0.105666 0.138227 1.000000
In [63]:
# drop our target variable
X = highest_corr_features.loc[:, highest_corr_features.columns != 'SalePrice'] 

# our target variable that we need to predict
y = highest_corr_features[['SalePrice']]

# split the data to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

# make an instance from the lr model
lin_reg_mod = LinearRegression()

# train the model - teach the model
lin_reg_mod.fit(X_train, y_train)

# predict unseen data (test dat)
pred = lin_reg_mod.predict(X_test)


# Evaluate the LR model
# MAE
test_set_mae = median_absolute_error(y_test, pred)
# RMSE
test_set_rmse = (np.sqrt(mean_squared_error(y_test, pred)))
#R^2
test_set_r2 = r2_score(y_test, pred)
#MSE
test_set_mse = (mean_squared_error(y_test, pred))

metric_values = [test_set_mse, test_set_rmse, test_set_mae,test_set_r2 ]
idx = ['MSE', 'RMSE', 'MAE', 'R2']
# pd.DataFrame(metric_values, index=idx)

df_mean = pd.DataFrame(metric_values, columns=['Mean Imputation'],index=idx)
df_mean
Out[63]:
Mean Imputation
MSE 0.032021
RMSE 0.178944
MAE 0.082128
R2 0.788500

b. Median Imputation

In [64]:
# Read csv
data = pd.read_csv("Prices.csv")
data.head()

# Data preprocessing
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]
data = pd.get_dummies(X)
data["SalePrice"]=y
data = np.log(data)
data[data==-np.inf]=0

# Median Imputation
data.fillna(data.median(), inplace=True)
print(data.isnull().sum().sum()) # check total null value
0
In [65]:
# Try to change one of the highest correlated features
highest_corr_features = data[['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea',
                             'FullBath', '1stFlrSF', 'YearRemodAdd', 'YearBuilt', 'GarageYrBlt', 'TotRmsAbvGrd']]
# highest_corr_features.head()
corr = highest_corr_features.corr()
corr.style.background_gradient(cmap='Blues')
Out[65]:
SalePrice OverallQual GrLivArea GarageCars GarageArea FullBath 1stFlrSF YearRemodAdd YearBuilt GarageYrBlt TotRmsAbvGrd
SalePrice 1.000000 0.794007 0.730255 0.659806 0.455197 0.598520 0.608947 0.565590 0.584442 0.493986 0.539261
OverallQual 0.794007 1.000000 0.601148 0.556868 0.397636 0.531549 0.431252 0.536980 0.553431 0.475576 0.426613
GrLivArea 0.730255 0.601148 1.000000 0.496139 0.298299 0.652849 0.545984 0.311515 0.232931 0.235328 0.827690
GarageCars 0.659806 0.556868 0.496139 1.000000 0.530889 0.537545 0.441902 0.453795 0.556988 0.567028 0.363041
GarageArea 0.455197 0.397636 0.298299 0.530889 1.000000 0.228520 0.286945 0.207472 0.341475 0.142180 0.193529
FullBath 0.598520 0.531549 0.652849 0.537545 0.228520 1.000000 0.373942 0.451403 0.490966 0.485163 0.544117
1stFlrSF 0.608947 0.431252 0.545984 0.441902 0.286945 0.373942 1.000000 0.233819 0.277025 0.210374 0.414699
YearRemodAdd 0.565590 0.536980 0.311515 0.453795 0.207472 0.451403 0.233819 1.000000 0.589813 0.614295 0.196213
YearBuilt 0.584442 0.553431 0.232931 0.556988 0.341475 0.490966 0.277025 0.589813 1.000000 0.773925 0.105666
GarageYrBlt 0.493986 0.475576 0.235328 0.567028 0.142180 0.485163 0.210374 0.614295 0.773925 1.000000 0.136364
TotRmsAbvGrd 0.539261 0.426613 0.827690 0.363041 0.193529 0.544117 0.414699 0.196213 0.105666 0.136364 1.000000
In [66]:
# drop our target variable
X = highest_corr_features.loc[:, highest_corr_features.columns != 'SalePrice'] 

# our target variable that we need to predict
y = highest_corr_features[['SalePrice']]

# split the data to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

# make an instance from the lr model
lin_reg_mod = LinearRegression()

# train the model - teach the model
lin_reg_mod.fit(X_train, y_train)

# predict unseen data (test dat)
pred = lin_reg_mod.predict(X_test)


# Evaluate the LR model
# MAE
test_set_mae = median_absolute_error(y_test, pred)
# RMSE
test_set_rmse = (np.sqrt(mean_squared_error(y_test, pred)))
#R^2
test_set_r2 = r2_score(y_test, pred)
#MSE
test_set_mse = (mean_squared_error(y_test, pred))

metric_values = [test_set_mse, test_set_rmse, test_set_mae,test_set_r2 ]
idx = ['MSE', 'RMSE', 'MAE', 'R2']
# pd.DataFrame(metric_values, index=idx)

df_median = pd.DataFrame(metric_values, columns=['Median Imputation'],index=idx)
df_median
Out[66]:
Median Imputation
MSE 0.032021
RMSE 0.178945
MAE 0.082052
R2 0.788498

c. Mode Imputation

In [67]:
# Read csv
data = pd.read_csv("Prices.csv")
data.head()

# Data preprocessing
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]
data = pd.get_dummies(X)
data["SalePrice"]=y
data = np.log(data)
data[data==-np.inf]=0

# Mode Imputation
data.fillna(data.mode(), inplace=True)
print(data.isnull().sum().sum()) # check total null value
348

From total null value above, mode imputation didn't work normally. Because of that reason, I decided not to use this Mode Imputation

d. Drop Missing Values

In [68]:
# Read csv
data = pd.read_csv("Prices.csv")
data.head()

# Data preprocessing
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]
data = pd.get_dummies(X)
data["SalePrice"]=y
data = np.log(data)
data[data==-np.inf]=0

# Drop Missing Values
data = data.dropna()
print(data.isnull().sum().sum()) # check total null value
0
In [69]:
# Try to change one of the highest correlated features
highest_corr_features = data[['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea',
                             'FullBath', '1stFlrSF', 'YearRemodAdd', 'YearBuilt', 'GarageYrBlt', 'TotRmsAbvGrd']]
# highest_corr_features.head()
corr = highest_corr_features.corr()
corr.style.background_gradient(cmap='Blues')
Out[69]:
SalePrice OverallQual GrLivArea GarageCars GarageArea FullBath 1stFlrSF YearRemodAdd YearBuilt GarageYrBlt TotRmsAbvGrd
SalePrice 1.000000 0.808367 0.723988 0.640013 0.619982 0.612034 0.609570 0.592395 0.591396 0.561320 0.549219
OverallQual 0.808367 1.000000 0.617099 0.545382 0.503332 0.572078 0.462969 0.566615 0.581846 0.542976 0.441084
GrLivArea 0.723988 0.617099 1.000000 0.494358 0.466619 0.641429 0.529925 0.315826 0.240221 0.269769 0.829797
GarageCars 0.640013 0.545382 0.494358 1.000000 0.859572 0.554726 0.449921 0.470517 0.545697 0.621017 0.389059
GarageArea 0.619982 0.503332 0.466619 0.859572 1.000000 0.487025 0.492210 0.422191 0.512051 0.647654 0.354434
FullBath 0.612034 0.572078 0.641429 0.554726 0.487025 1.000000 0.373064 0.480371 0.525647 0.519348 0.534027
1stFlrSF 0.609570 0.462969 0.529925 0.449921 0.492210 0.373064 1.000000 0.269182 0.296327 0.266345 0.404910
YearRemodAdd 0.592395 0.566615 0.315826 0.470517 0.422191 0.480371 0.269182 1.000000 0.620186 0.643778 0.186469
YearBuilt 0.591396 0.581846 0.240221 0.545697 0.512051 0.525647 0.296327 0.620186 1.000000 0.820829 0.131952
GarageYrBlt 0.561320 0.542976 0.269769 0.621017 0.647654 0.519348 0.266345 0.643778 0.820829 1.000000 0.170089
TotRmsAbvGrd 0.549219 0.441084 0.829797 0.389059 0.354434 0.534027 0.404910 0.186469 0.131952 0.170089 1.000000
In [70]:
# drop our target variable
X = highest_corr_features.loc[:, highest_corr_features.columns != 'SalePrice'] 

# our target variable that we need to predict
y = highest_corr_features[['SalePrice']]

# split the data to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

# make an instance from the lr model
lin_reg_mod = LinearRegression()

# train the model - teach the model
lin_reg_mod.fit(X_train, y_train)

# predict unseen data (test dat)
pred = lin_reg_mod.predict(X_test)


# Evaluate the LR model
# MAE
test_set_mae = median_absolute_error(y_test, pred)
# RMSE
test_set_rmse = (np.sqrt(mean_squared_error(y_test, pred)))
#R^2
test_set_r2 = r2_score(y_test, pred)
#MSE
test_set_mse = (mean_squared_error(y_test, pred))

metric_values = [test_set_mse, test_set_rmse, test_set_mae,test_set_r2 ]
idx = ['MSE', 'RMSE', 'MAE', 'R2']
# pd.DataFrame(metric_values, index=idx)

df_dropNA = pd.DataFrame(metric_values, columns=['Drop Missing Values'],index=idx)
df_dropNA
Out[70]:
Drop Missing Values
MSE 0.037981
RMSE 0.194887
MAE 0.097126
R2 0.734573

Merge all three model evalutation dataframes

In [71]:
df_merged = df_mean.merge(df_median, left_index=True, right_index=True)
df_merged = df_merged.merge(df_dropNA, left_index=True, right_index=True)
df_merged
Out[71]:
Mean Imputation Median Imputation Drop Missing Values
MSE 0.032021 0.032021 0.037981
RMSE 0.178944 0.178945 0.194887
MAE 0.082128 0.082052 0.097126
R2 0.788500 0.788498 0.734573

**Answer:** Median Imputation

In [72]:
## Place here the values for MAE, RMSE, R^2 received from the best method
mae_best = 0.082052
rmse_best = 0.178945
r2_best = 0.788498

3.3 From 3.2 keep the best method to deal with missing values and use PCA to reduce the number of features to 34 components. Keep random_state for PCA equal to 0. (0.30 points)

Keep using median imputation to deal with missing values

In [73]:
# Read csv
data = pd.read_csv("Prices.csv")
data.head()

# Data preprocessing
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]
data = pd.get_dummies(X)
data["SalePrice"]=y
data = np.log(data)
data[data==-np.inf]=0

# Median Imputation
data.fillna(data.median(), inplace=True)
In [74]:
x = data.loc[:, data.columns != 'SalePrice'] # features 
y = data[['SalePrice']] # target / label
In [75]:
x.shape
Out[75]:
(1460, 270)
In [76]:
y.shape
Out[76]:
(1460, 1)
In [77]:
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn import linear_model
from sklearn.metrics import make_scorer
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn
In [78]:
# features and target
X = x
y = y

# convert features to numpy array
X = X.to_numpy()

# n_components: Number of components (features) to keep.

# whiten: When True (False by default) the components_ vectors are multiplied by the square root of n_samples
# and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.

# svd (Singular Value Decomposition): project data to a lower dimensional space.

# get an instance from PCA
pca = PCA(n_components=34,whiten=True,svd_solver='randomized',random_state=0)

# fitting or teaching PCA 
pca = pca.fit(X)

# generate new features based on PCA technique by using transform()
dataPCA = pca.transform(X)

3.4. What percentage of the variance is explained by the first component ? (0.30 points)

In [79]:
df_dataPCA = pd.DataFrame(dataPCA)
df_dataPCA_34 = df_dataPCA.head(34)
df_dataPCA_34
Out[79]:
0 1 2 3 4 5 6 7 8 9 ... 24 25 26 27 28 29 30 31 32 33
0 0.400974 1.223118 -0.910404 -1.648483 0.139584 0.435578 -0.501645 0.243608 0.316857 -0.501486 ... -0.158593 -0.331980 1.026201 -1.012958 0.015676 -0.008973 0.272202 0.003541 -0.583692 0.291445
1 -1.059000 -0.120889 -0.097151 1.528199 -0.218984 -1.361843 -0.571234 -0.416827 -0.235038 -0.431510 ... 0.289730 1.006162 -0.421546 -0.115951 0.057855 0.139912 0.001934 0.012801 0.894702 -1.496493
2 0.499830 1.138785 -0.737011 -1.648709 0.041533 0.065662 -0.241826 -0.029876 0.373403 -0.596759 ... -0.288695 -0.523890 -0.377513 1.122547 -0.094027 0.191684 0.582241 -0.009410 -0.787535 0.346772
3 0.938247 -0.145813 -1.420151 -0.658421 -0.848852 -0.737478 1.159800 2.221904 -1.111513 -0.318143 ... -0.761815 -1.176662 0.413011 0.485354 -0.020799 -0.362870 -0.250108 0.051326 0.836033 0.165281
4 0.408861 2.169936 -0.115561 -0.090386 0.141905 -0.196373 -0.201000 0.327852 0.127142 -0.191930 ... -0.395345 -0.545462 0.348828 -0.252183 -0.006241 -0.319694 -0.019098 0.073775 0.123294 0.179951
5 0.383819 0.643082 -1.385256 0.774144 -0.773701 0.047457 -0.945066 0.209817 -0.828595 0.082453 ... -1.039270 1.040346 0.752677 -0.446130 -1.349881 -0.079257 0.048935 0.418163 -1.379874 -0.536765
6 -1.229837 1.242792 0.802037 0.208650 -0.232901 -0.179362 -0.313493 0.657992 -0.020233 0.014807 ... -0.436356 -0.560762 0.444652 0.314703 -0.111362 -0.011551 -0.094308 -0.057172 -0.053930 0.369538
7 0.348080 1.881109 -0.797479 0.382790 0.457095 0.489994 2.493813 2.770129 -0.682504 1.164009 ... 2.931168 0.242677 -0.274590 0.475419 -0.283180 -0.318315 0.192169 0.220432 0.009326 -1.199703
8 1.649848 -0.536223 0.129717 1.181699 0.844723 -1.464962 1.348128 1.297599 -1.064015 -0.086900 ... 3.061848 2.193115 4.982909 2.004709 -2.292259 -1.328325 -0.995927 0.695916 1.667349 -1.223156
9 -0.898924 -0.846916 -0.747049 -0.400070 -0.615770 -0.336070 -0.850029 -0.158066 0.353362 -0.365661 ... 2.702377 1.421406 2.881049 2.578727 -2.727677 -0.610971 0.075845 0.150896 0.248602 -0.687227
10 -0.999668 -0.970582 -0.783719 -0.335214 -0.260341 -0.714998 -0.688375 -0.537596 -0.067794 -0.820846 ... -0.286958 -1.042434 -0.020593 0.799046 -0.307212 0.255459 0.173693 -0.113515 -0.295916 -0.275959
11 0.304754 2.007712 -0.501975 -0.013618 0.663650 -0.330023 -0.419669 0.176489 0.048610 -0.281806 ... 3.463514 -0.633753 1.362642 -1.941198 -0.115125 -0.295489 -0.465686 0.041447 0.002069 0.317575
12 -1.051343 -0.302167 -0.283340 1.148768 -0.159264 -0.977522 -0.464698 -1.588229 -2.001643 2.569676 ... -0.548899 -0.048852 0.353323 1.248263 -0.513497 0.056099 0.261868 0.160866 -0.154304 -1.082917
13 -0.059777 0.368567 2.650993 0.095878 0.471063 0.283332 0.293424 0.039715 -0.156419 -0.216208 ... 0.443537 0.334330 -0.032698 -0.195807 0.253163 -0.254417 -0.430027 -0.081495 -0.186370 0.287834
14 -0.894987 0.232398 0.289872 -1.771392 -0.480814 -0.068762 1.204569 3.106548 -0.510942 0.485299 ... -1.022971 -0.362993 0.518334 1.019528 -0.185131 0.166808 0.369175 0.057937 0.017251 -1.112792
15 0.221495 -0.732081 1.727166 0.793912 -1.281914 0.743502 -0.244456 0.110959 -0.305923 -0.281093 ... 0.241830 -0.001683 1.266405 -0.074569 0.020143 -0.288952 -0.952323 -0.087760 1.353240 1.427736
16 -1.106112 -0.130893 0.083335 -1.327415 0.975517 -0.817293 -0.162628 -0.394435 -0.375472 -0.053444 ... -0.732861 0.087513 0.444470 0.371209 -0.495623 -0.187387 -0.231296 -0.025045 0.605132 -0.659009
17 0.038290 -2.109077 -0.014064 0.378731 1.924423 2.606942 -2.632852 0.423385 -2.386939 -0.335035 ... -0.689380 1.467813 0.859037 1.212673 -1.313598 -0.215552 0.120898 0.234031 -0.365260 -0.310163
18 -0.791967 -0.476934 -0.212623 -0.671652 -1.971635 0.241229 -0.405600 0.193871 -0.137635 -0.653132 ... -0.377330 -0.065561 0.292457 -0.236372 -0.086364 -0.035882 -0.238710 0.008939 -2.381423 0.221338
19 -0.848449 -1.026577 -0.482024 -0.406633 -0.433463 -1.036728 -0.406177 -0.721267 0.251003 -0.706589 ... -0.727893 0.006337 -0.585957 0.085963 0.063402 0.270815 0.965636 0.045763 -0.254423 -0.807475
20 1.505541 1.529435 1.544588 0.068078 0.435278 0.591814 0.201940 0.103090 0.039864 -0.149643 ... 0.349456 -0.021470 -0.297850 -0.339598 0.231046 -0.037847 0.609540 0.003253 0.293159 0.086151
21 0.290086 -2.057265 0.726483 -0.238744 0.514057 -0.798269 1.201904 1.551546 -0.972902 -0.120996 ... -0.362344 -1.754898 0.596898 -0.174171 0.404466 0.144396 -0.117404 -0.260068 1.219005 -1.276000
22 -0.011767 0.506864 2.765052 0.038177 -0.090639 0.618846 0.301273 0.388962 0.139695 0.145254 ... -0.251150 -0.306248 -0.065911 0.570023 0.067741 -0.012790 0.128000 -0.138499 -0.004090 -0.083247
23 -0.880074 0.186185 0.045232 0.932325 -1.622079 0.153487 -0.782358 0.577099 -0.248064 -0.172543 ... 0.159704 0.705845 -0.820860 -1.038899 0.304573 -0.228650 -0.291390 -0.165932 0.128194 -1.070154
24 -0.942802 0.088389 0.085341 1.975013 -1.011912 1.909923 1.925611 -0.366764 1.233657 -0.021700 ... -0.149408 0.888106 0.561471 -0.254463 0.227935 -0.266135 0.044897 0.238765 -0.990464 0.618087
25 -0.007153 -0.132713 2.273981 -1.796673 0.278809 0.801005 0.439319 -0.064505 -0.022806 -0.630760 ... 0.155435 -0.386320 0.049417 0.055386 0.189977 -0.020441 -0.413437 -0.195247 -0.322246 0.336324
26 -0.990557 -0.035789 -0.069500 1.798288 -0.723689 1.634968 1.808368 -0.617959 0.746207 -0.526034 ... 0.544304 0.699761 0.335194 0.678342 0.281870 -0.550019 -0.561064 0.178159 -0.458307 1.166368
27 -1.127181 0.497748 0.324767 -1.699997 -0.459856 0.084359 -0.094187 0.283520 0.049294 -0.616668 ... -0.220030 -0.447459 0.086273 0.175838 0.045317 0.029906 -0.223160 -0.106988 -0.191494 0.547697
28 -0.990322 0.437289 0.159089 1.229110 -1.993781 0.178552 -0.697952 0.846277 -0.034301 0.266713 ... 1.410712 1.028563 0.045962 -0.238853 -1.130144 0.501079 1.382270 0.138195 -0.802193 0.945683
29 0.189213 -1.549223 1.128906 1.107365 0.679847 -0.899161 0.764429 1.410044 -0.829203 0.223970 ... 0.703950 1.498434 1.951878 -2.024010 0.089043 -0.141715 -1.248686 0.250792 2.004998 -0.583574
30 1.778598 -0.837017 -0.131419 -0.509884 -0.491795 0.112696 1.113714 2.012543 -0.597527 0.161811 ... -0.058294 -0.204072 -0.313025 -0.356756 0.207500 0.204850 1.613854 0.108610 1.285746 0.432838
31 0.260599 -1.343866 1.346228 -0.510915 -1.363811 0.684372 -0.035113 -0.187776 0.254883 -0.326348 ... -0.297974 0.284659 -0.337088 -0.003015 0.107687 0.180547 0.512676 -0.042876 -1.371429 1.366400
32 0.240040 -1.360318 1.325482 -0.482511 -1.144278 0.493887 0.041066 -0.423188 -0.106760 -0.716030 ... 0.422358 -1.501702 1.372630 -0.585698 0.193809 0.037762 -1.110688 -0.221126 -0.582201 0.172418
33 -0.893839 -0.519589 -0.409047 -0.623173 -1.675884 -0.066446 -0.424457 0.040695 -0.009419 -0.622434 ... -0.723936 -0.064085 -1.860408 1.439786 -0.142105 0.189442 1.226142 -0.089765 0.854928 -0.676868

34 rows × 34 columns

In [80]:
# Count proportion of variance by first component
df_dataPCA_34[0][0] / sum(np.diagonal(df_dataPCA_34))

# Reference
# https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained/22571
Out[80]:
-0.07244317524702533

**Answer:** 7.24%

3.5. Use the new components derived from PCA to predict the house pricing. Keep the ratio of test and train set to 20/80 and the random_state equal to 0. Report MAE, RMSE and R2 (0.30 points)

In [81]:
train_pca , test_pca, train_y_orig, test_y_orig = train_test_split(
        dataPCA,
        y,
        test_size=0.20,
        random_state=0)
In [82]:
# # manual method to split data (original data)
# train_original = X[:1200]
# test_original = X[1200:]

# train_y_orig = y[:1200]
# test_y_orig = y[1200:]

# # Split traing and test
# train_pca = dataPCA[:1200]
# test_pca = dataPCA[1200:]
In [83]:
def get_mae(x_train, y_train, x_test,y_test):
    results={}
    
    def mae_model(clf):  
        # train the model - teach the model
        clf.fit(x_train, y_train)

        # predict unseen data (test dat)
        pred = clf.predict(x_test)

        # R2
        mae_val_score = median_absolute_error(y_test, pred)

        scores=[mae_val_score.mean()]
        return scores

    clf = linear_model.LinearRegression()
    results["MAE"]=mae_model(clf)

    
    results = pd.DataFrame.from_dict(results,orient='index')
    results.columns=["MAE Score"] 
#     results.plot(kind="bar",title="Model Scores")
#     axes = plt.gca()
#     axes.set_ylim([0.05,0.1])
    return results

# PCA model
get_mae(train_pca, train_y_orig, test_pca,test_y_orig)
Out[83]:
MAE Score
MAE 0.074581
In [84]:
from sklearn import metrics
def get_rmse(x_train, y_train, x_test,y_test):
    results={}
    
    def rmse_model(clf):  
        # train the model - teach the model
        clf.fit(x_train, y_train)

        # predict unseen data (test dat)
        pred = clf.predict(x_test)

        # RMSE
        rmse_val_score = np.sqrt(mean_squared_error(y_test, pred))

        scores=[rmse_val_score.mean()]
        return scores

    clf = linear_model.LinearRegression()
    results["RMSE"]=rmse_model(clf)

    
    results = pd.DataFrame.from_dict(results,orient='index')
    results.columns=["RMSE Score"] 
# #     results.plot(kind="bar",title="Model Scores")
# #     axes = plt.gca()
# #     axes.set_ylim([0.1,0.2])
    return results

# PCA model
get_rmse(train_pca, train_y_orig, test_pca,test_y_orig)
Out[84]:
RMSE Score
RMSE 0.174036
In [85]:
def get_r2(x_train, y_train, x_test,y_test):
    results={}
    
    def r2_model(clf):  
        # train the model - teach the model
        clf.fit(x_train, y_train)

        # predict unseen data (test dat)
        pred = clf.predict(x_test)

        # R2
        r2_val_score = r2_score(y_test, pred)

        scores=[r2_val_score.mean()]
        return scores

    clf = linear_model.LinearRegression()
    results["R-square"]=r2_model(clf)

    
    results = pd.DataFrame.from_dict(results,orient='index')
    results.columns=["R Square Score"]
#     results.plot(kind="bar",title="Model Scores")
#     axes = plt.gca()
#     axes.set_ylim([0.5,1])
    return results

# PCA model
get_r2(train_pca, train_y_orig, test_pca,test_y_orig)
Out[85]:
R Square Score
R-square 0.799944
In [86]:
## Calculate the values for MAE, RMSE, R^2 received after applying PCA

mae_pca = 0.074581
rmse_pca = 0.174036
r2_pca = 0.799944
In [87]:
print("MAE difference after PCA: ", mae_best-mae_pca)
print("RMSE difference after PCA: ", rmse_best-rmse_pca)
print("R2 difference after PCA: ", r2_best-r2_pca)
MAE difference after PCA:  0.007471000000000005
RMSE difference after PCA:  0.004908999999999997
R2 difference after PCA:  -0.011445999999999956

How long did it take you to solve the homework?

  • Please answer as precisely as you can. It does not affect your points or grade in any way. It is okay, if it took 0.5 hours or 24 hours. The collected information will be used to improve future homeworks.

**Answer:**

(please change X in the next cell into your estimate)

17 hours

What is the level of difficulty for this homework?

you can put only number between $0:10$ ($0:$ easy, $10:$ difficult)

**Answer:** 7